Integration of pathologic characteristics, genetic risk and lifestyle exposure for colorectal cancer survival assessment

The development of an effective survival prediction tool is key for reducing colorectal cancer mortality. Here, we apply a three-stage study to devise a polygenic prognostic score (PPS) for stratifying colorectal cancer overall survival. Leveraging two cohorts of 3703 patients, we first perform a genome-wide survival association analysis to develop eight candidate PPSs. Further using an independent cohort with 470 patients, we identify the 287 variants-derived PPS (i.e., PPS287) achieving an optimal prediction performance [hazard ratio (HR) per SD = 1.99, P = 1.76 × 10−8], accompanied by additional tests in two external cohorts, with HRs per SD of 1.90 (P = 3.21 × 10−14; 543 patients) and 1.80 (P = 1.11 × 10−9; 713 patients). Notably, the detrimental impact of pathologic characteristics and genetic risk could be attenuated by a healthy lifestyle, yielding a 7.62% improvement in the 5-year overall survival rate. Therefore, our findings demonstrate the integrated contribution of pathologic characteristics, germline variants, and lifestyle exposure to the prognosis of colorectal cancer patients.

Furthermore, we performed expression quantitative trait loci (eQTL) analysis to evaluate the effects of significant variants on the expression of their nearby genes (within ±1 Mb region) using the data of normal Colon-Sigmoid tissues and Colon-Transverse tissues from the Genotype-Tissue Expression (GTEx, https://www.gtexportal.org/home/).

Clumping and P value thresholding
The clumping and P value thresholding (i.e., C+T) approach, as a classic method, is used to calculate PPS using a subset of partially independent (i.e., clumped) SNPs exceeding a specific GWAS association P value threshold 7,8 .With the combined NJCRC and UK Biobank dataset as linkage disequilibrium (LD) reference panel, leveraging the summary statistics of EAS-EUR meta-analysis for candidate SNPs (LD r 2 < 0.1), we used PLINK software (version 1.90) to obtain three subsets of variants, where we set the region size to be 500 kb, with different P value thresholds (i.e., 1.00×10 -5 , 1.00×10 -4 , and 0.001).

LASSO
The least absolute shrinkage and selection operator (LASSO) is a popular penalized regression method used in high dimensional data to prevent overfitting, with an L1 penalty to shrink some regression coefficients to zero 9,10 .The larger the value of penalty parameter lambda (λ, i.e., tuning parameter), fewer predictors will be selected.
We adopted a penalized Cox regression model with LASSO penalty to achieve shrinkage and variable selection simultaneously, with ten-fold cross validations for determining the optimal values of lambda, implemented by R package glmnet.The optimal lambda was selected via 1-standard error (SE) criteria, to determine included SNPs for PPS construction.Finally, we constructed two LASSO-based PPSs based on the weights derived from meta-analysis or the LASSO penalized regression.

Random survival forest (RSF)
The RSF method, an extension of Breiman's random forest, which obtains bootstrap samples from the original cohort, and then grows a tree for each bootstrapped sample on the basis of a splitting rule applied to a tree node to maximize survival differences across daughter nodes 11 .The process is repeated numerous times (number of trees = 2,000 in this study) so that a forest of trees is created.The importance of each variable was determined by variable importance (VIMP), derived from the difference between the out-of-bag (OOB) c-indexes of the original OOB data and that of the permuted OOB data, where variables with larger VIMP are considered more predictive, implemented by R package randomForestSRC.We constructed candidate PPSs by adding SNPs in the decremental order of VIMP, and the PPS with the highest 5-year AUC in the validation dataset (i.e., TCGA cohort) was determined as optimal.

CoxBoost
The CoxBoost method, a likelihood-based boosting algorithm in the Cox proportional hazards model 12

Table 1 .
. Likelihood-based boosting usually uses base learners that maximise an overall likelihood in each boosting step, which selects only the base-learner with the largest increasement in the likelihood.CoxBoost is used for models with numerical predictors and allows for mandatory covariates with unpenalized parameter estimates.We adopted a boosted Cox regression model in feature selection, with ten-fold cross validations for determining the optimal boosting steps, implemented by R package CoxBoost.After identifying included SNPs for PPS construction, we constructed two CoxBoost-based PPSs based on the weights derived from metaanalysis or the boosted regression.Summary of five lifestyle factors in the PLCO cohort.Definition of healthy behaviour category.Note: BMI, body mass index; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial.
a Each lifestyle factor was given a score of 0 or 1, with 1 representing the healthy category.b

Table 2 .
Summary of two suggestive genome-wide significant loci associated with colorectal cancer overall survival.Derived from Cox regression model, with the adjustment of corresponding covariates (NJCRC cohort: sex, age, smoking status, drinking status, stage, grade and top 10 principal components; UK Biobank cohort: sex, age, BMI, smoking status, drinking status and top 10 principal components).Combined results were obtained from Meta-analysis.The P value is two-sided.fP value for heterogeneity test.The P value is two-sided.Note: BMI, body mass index.survival-associatedsignificant loci.The association of two loci with colorectal cancer risk, derived from Meta-analysis of colorectal cancer GWAS in East Asian and European population.The P value is two-sided.
d Risk allele frequency.e c Risk/reference allele.d

Table 4 .
Validation of the optimal polygenic prognostic score (i.e., PPS287) in the ZJCRC and PLCO cohorts.Derived from cox regression model, with the adjustment of corresponding factors (ZJCRC: sex, age, smoking status, drinking status and top 10 principal components; PLCO: sex, age, smoking status, drinking status, research center, arm, stage, grade and top 10 principal components).The P value is two-sided.b Area under the time-dependent ROC curve.Note: PPS, polygenic prognostic score; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial; HR, hazard ratio; 95% CI, 95% confidence interval; ROC, receiver operating characteristics; SD, standard deviation. a

Table 5 .
Sensitivity analysis for the association between polygenic prognostic score (i.e., PPS287) and colorectal cancer overall survival in the ZJCRC and PLCO cohorts.Derived from cox regression model, with the adjustment of corresponding factors (ZJCRC: sex, age, smoking status, drinking status and top 10 principal components; PLCO: sex, age, smoking status, drinking status, research center, arm, stage, grade and top 10 principal components) when appropriate.The P value is two-sided.Note: PPS, polygenic prognostic score; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial; HR, hazard ratio; 95% CI, 95% confidence interval; SD, standard deviation. a

Table 6 .
Univariate and multivariate analysis for the association of traditional risk factors and polygenic prognostic score with colorectal cancer overall survival in the ZJCRC cohort.Derived from the Cox regression model, with the additional adjustment of top 10 principal components.The P value is two-sided.Note: PPS, polygenic prognostic score.

Table 7 .
Univariate and multivariate analysis for the association of traditional risk factors and polygenic prognostic score with colorectal cancer overall survival in the PLCO cohort.Derived from the Cox regression model, with the additional adjustment of arm, research center and top 10 principal components.The P value is two-sided.b Combined clinical and pathologic stage (stage I, stage II, stage III and stage IV) for PLCO cohort.c G1, well differentiated; G2, moderately differentiated; G3, poorly differentiated; G4, undifferentiated.d Trend analysis.Note: PPS, polygenic prognostic score; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial.

Table 8 .
Performance comparison regarding 5-year survival prediction of different colorectal cancer prognostic models in the ZJCRC and PLCO cohorts.The traditional model included sex, age, smoking status and drinking status in the ZJCRC cohort; sex, age, smoking status, drinking status, stage and grade in the PLCO cohort.The combined model included traditional factors and PPS.b AUC at 5-year survival.c Derived from bootstrap method with 10,000 iterations.The P value is two-sided.Note: PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial; PPS, polygenic prognostic score; ROC, receiver operating characteristics; AUC, area under the curve.